BDAT 1004 - 02 PROBLEM SET 3

NAME: GOODNEWS AGBADU

STUDENT NUMBER: 200532679

QUESTION 1

Introduction: Special thanks to: https://github.com/justmarkham for sharing the dataset and materials. Occupations

Step 1. Import the necessary libraries

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called users

Step 4. Discover what is the mean age per occupation

Step 5. Discover the Male ratio per occupation and sort it from the most to the least

Step 6. For each occupation, calculate the minimum and maximum ages

Step 7. For each combination of occupation and sex, calculate the mean age

Step 8. For each occupation present the percentage of women and men

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

users = pd.read_csv(r'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user', sep='|')
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
In [2]:
users.head()
Out[2]:
user_id age gender occupation zip_code
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213

Step 4. Discover what is the mean age per occupation

In [3]:
mean_age = users.groupby("occupation").age.mean()
mean_age
Out[3]:
occupation
administrator    38.746835
artist           31.392857
doctor           43.571429
educator         42.010526
engineer         36.388060
entertainment    29.222222
executive        38.718750
healthcare       41.562500
homemaker        32.571429
lawyer           36.750000
librarian        40.000000
marketing        37.615385
none             26.555556
other            34.523810
programmer       33.121212
retired          63.071429
salesman         35.666667
scientist        35.548387
student          22.081633
technician       33.148148
writer           36.311111
Name: age, dtype: float64

Step 5. Discover the Male ratio per occupation and sort it from the most to the least

In [4]:
#group gender by the occupation
all_genders = users.groupby("occupation")["gender"].value_counts(normalize=True)*100

#Convert to a dataframe and rename columns
df_gender = pd.DataFrame(all_genders)
df_gender.rename(columns={"gender":"percent"}, inplace=True)

#Add index
index_gen = pd.DataFrame(df_gender.reset_index(level=["occupation", "gender"]))
index_gen.head()
Out[4]:
occupation gender percent
0 administrator M 54.430380
1 administrator F 45.569620
2 artist M 53.571429
3 artist F 46.428571
4 doctor M 100.000000
In [5]:
male_ratio = index_gen[index_gen["gender"] == "M"].sort_values(by="percent",ascending=False)
male_ratio
Out[5]:
occupation gender percent
4 doctor M 100.000000
7 engineer M 97.014925
37 technician M 96.296296
29 retired M 92.857143
27 programmer M 90.909091
11 executive M 90.625000
33 scientist M 90.322581
9 entertainment M 88.888889
17 lawyer M 83.333333
31 salesman M 75.000000
5 educator M 72.631579
35 student M 69.387755
25 other M 65.714286
21 marketing M 61.538462
39 writer M 57.777778
23 none M 55.555556
0 administrator M 54.430380
2 artist M 53.571429
20 librarian M 43.137255
14 healthcare M 31.250000
16 homemaker M 14.285714

Step 6. For each occupation, calculate the minimum and maximum ages

In [6]:
users.groupby("occupation").age.agg([min, max])
Out[6]:
min max
occupation
administrator 21 70
artist 19 48
doctor 28 64
educator 23 63
engineer 22 70
entertainment 15 50
executive 22 69
healthcare 22 62
homemaker 20 50
lawyer 21 53
librarian 23 69
marketing 24 55
none 11 55
other 13 64
programmer 20 63
retired 51 73
salesman 18 66
scientist 23 55
student 7 42
technician 21 55
writer 18 60

Step 7. For each combination of occupation and sex, calculate the mean age

In [7]:
users.groupby(['occupation','gender']).age.mean()
Out[7]:
occupation     gender
administrator  F         40.638889
               M         37.162791
artist         F         30.307692
               M         32.333333
doctor         M         43.571429
educator       F         39.115385
               M         43.101449
engineer       F         29.500000
               M         36.600000
entertainment  F         31.000000
               M         29.000000
executive      F         44.000000
               M         38.172414
healthcare     F         39.818182
               M         45.400000
homemaker      F         34.166667
               M         23.000000
lawyer         F         39.500000
               M         36.200000
librarian      F         40.000000
               M         40.000000
marketing      F         37.200000
               M         37.875000
none           F         36.500000
               M         18.600000
other          F         35.472222
               M         34.028986
programmer     F         32.166667
               M         33.216667
retired        F         70.000000
               M         62.538462
salesman       F         27.000000
               M         38.555556
scientist      F         28.333333
               M         36.321429
student        F         20.750000
               M         22.669118
technician     F         38.000000
               M         32.961538
writer         F         37.631579
               M         35.346154
Name: age, dtype: float64

Step 8. For each occupation present the percentage of women and men

In [8]:
index_gen
Out[8]:
occupation gender percent
0 administrator M 54.430380
1 administrator F 45.569620
2 artist M 53.571429
3 artist F 46.428571
4 doctor M 100.000000
5 educator M 72.631579
6 educator F 27.368421
7 engineer M 97.014925
8 engineer F 2.985075
9 entertainment M 88.888889
10 entertainment F 11.111111
11 executive M 90.625000
12 executive F 9.375000
13 healthcare F 68.750000
14 healthcare M 31.250000
15 homemaker F 85.714286
16 homemaker M 14.285714
17 lawyer M 83.333333
18 lawyer F 16.666667
19 librarian F 56.862745
20 librarian M 43.137255
21 marketing M 61.538462
22 marketing F 38.461538
23 none M 55.555556
24 none F 44.444444
25 other M 65.714286
26 other F 34.285714
27 programmer M 90.909091
28 programmer F 9.090909
29 retired M 92.857143
30 retired F 7.142857
31 salesman M 75.000000
32 salesman F 25.000000
33 scientist M 90.322581
34 scientist F 9.677419
35 student M 69.387755
36 student F 30.612245
37 technician M 96.296296
38 technician F 3.703704
39 writer M 57.777778
40 writer F 42.222222
QUESTION 2

Euro Teams

Step 1. Import the necessary libraries

Step 2. Import the dataset from this address

Step 3. Assign it to a variable called euro12

In [80]:
import pandas as pd
import numpy as np
In [9]:
euro12 = pd.read_csv(r'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/02_Filtering_%26_Sorting/Euro12/Euro_2012_stats_TEAM.csv')

euro12.head()
Out[9]:
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used
0 Croatia 4 13 12 51.9% 16.0% 32 0 0 0 ... 13 81.3% 41 62 2 9 0 9 9 16
1 Czech Republic 4 13 18 41.9% 12.9% 39 0 0 0 ... 9 60.1% 53 73 8 7 0 11 11 19
2 Denmark 4 10 10 50.0% 20.0% 27 1 0 0 ... 10 66.7% 25 38 8 4 0 7 7 15
3 England 5 11 18 50.0% 17.2% 40 0 0 0 ... 22 88.1% 43 45 6 5 0 11 11 16
4 France 3 22 24 37.9% 6.5% 65 1 0 0 ... 6 54.6% 36 51 5 6 0 11 11 19

5 rows × 35 columns

Step 4. Select only the Goal column

In [10]:
euro12['Goals']
Out[10]:
0      4
1      4
2      4
3      5
4      3
5     10
6      5
7      6
8      2
9      2
10     6
11     1
12     5
13    12
14     5
15     2
Name: Goals, dtype: int64

Step 5. How many team participated in the Euro2012?

In [11]:
euro12.shape[0]
Out[11]:
16

Step 6. What is the number of columns in the dataset?

In [12]:
euro12.shape[1]
Out[12]:
35

Step 7. View only the columns Team, Yellow Cards and Red Cards and assign them to a dataframe called discipline

In [13]:
discipline = euro12[['Team','Yellow Cards','Red Cards']]
discipline
Out[13]:
Team Yellow Cards Red Cards
0 Croatia 9 0
1 Czech Republic 7 0
2 Denmark 4 0
3 England 5 0
4 France 6 0
5 Germany 4 0
6 Greece 9 1
7 Italy 16 0
8 Netherlands 5 0
9 Poland 7 1
10 Portugal 12 0
11 Republic of Ireland 6 1
12 Russia 6 0
13 Spain 11 0
14 Sweden 7 0
15 Ukraine 5 0

Step 8. Sort the teams by Red Cards, then to Yellow Cards

In [14]:
discipline.sort_values(['Red Cards','Yellow Cards'],ascending = False)
Out[14]:
Team Yellow Cards Red Cards
6 Greece 9 1
9 Poland 7 1
11 Republic of Ireland 6 1
7 Italy 16 0
10 Portugal 12 0
13 Spain 11 0
0 Croatia 9 0
1 Czech Republic 7 0
14 Sweden 7 0
4 France 6 0
12 Russia 6 0
3 England 5 0
8 Netherlands 5 0
15 Ukraine 5 0
2 Denmark 4 0
5 Germany 4 0

Step 9. Calculate the mean Yellow Cards given per Team

In [15]:
round(discipline['Yellow Cards'].mean(),2)
Out[15]:
7.44

Step 10. Filter teams that scored more than 6 goals

In [16]:
euro12[euro12.Goals > 6]
Out[16]:
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used
5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 10 62.6% 63 49 12 4 0 15 15 17
13 Spain 12 42 33 55.9% 16.0% 100 0 1 0 ... 15 93.8% 102 83 19 11 0 17 17 18

2 rows × 35 columns

Step 11. Select the teams that start with G

In [17]:
(euro12[euro12['Team'].str.startswith('G')])
Out[17]:
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used
5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 10 62.6% 63 49 12 4 0 15 15 17
6 Greece 5 8 18 30.7% 19.2% 32 1 1 1 ... 13 65.1% 67 48 12 9 1 12 12 20

2 rows × 35 columns

Step 12. Select the first 7 columns

In [18]:
euro12.iloc[:,0:7]
Out[18]:
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked)
0 Croatia 4 13 12 51.9% 16.0% 32
1 Czech Republic 4 13 18 41.9% 12.9% 39
2 Denmark 4 10 10 50.0% 20.0% 27
3 England 5 11 18 50.0% 17.2% 40
4 France 3 22 24 37.9% 6.5% 65
5 Germany 10 32 32 47.8% 15.6% 80
6 Greece 5 8 18 30.7% 19.2% 32
7 Italy 6 34 45 43.0% 7.5% 110
8 Netherlands 2 12 36 25.0% 4.1% 60
9 Poland 2 15 23 39.4% 5.2% 48
10 Portugal 6 22 42 34.3% 9.3% 82
11 Republic of Ireland 1 7 12 36.8% 5.2% 28
12 Russia 5 9 31 22.5% 12.5% 59
13 Spain 12 42 33 55.9% 16.0% 100
14 Sweden 5 17 19 47.2% 13.8% 39
15 Ukraine 2 7 26 21.2% 6.0% 38

Step 13. Select all columns except the last 3

In [19]:
euro12[euro12.columns[0:-3]]
Out[19]:
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Clean Sheets Blocks Goals conceded Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards
0 Croatia 4 13 12 51.9% 16.0% 32 0 0 0 ... 0 10 3 13 81.3% 41 62 2 9 0
1 Czech Republic 4 13 18 41.9% 12.9% 39 0 0 0 ... 1 10 6 9 60.1% 53 73 8 7 0
2 Denmark 4 10 10 50.0% 20.0% 27 1 0 0 ... 1 10 5 10 66.7% 25 38 8 4 0
3 England 5 11 18 50.0% 17.2% 40 0 0 0 ... 2 29 3 22 88.1% 43 45 6 5 0
4 France 3 22 24 37.9% 6.5% 65 1 0 0 ... 1 7 5 6 54.6% 36 51 5 6 0
5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 1 11 6 10 62.6% 63 49 12 4 0
6 Greece 5 8 18 30.7% 19.2% 32 1 1 1 ... 1 23 7 13 65.1% 67 48 12 9 1
7 Italy 6 34 45 43.0% 7.5% 110 2 0 0 ... 2 18 7 20 74.1% 101 89 16 16 0
8 Netherlands 2 12 36 25.0% 4.1% 60 2 0 0 ... 0 9 5 12 70.6% 35 30 3 5 0
9 Poland 2 15 23 39.4% 5.2% 48 0 0 0 ... 0 8 3 6 66.7% 48 56 3 7 1
10 Portugal 6 22 42 34.3% 9.3% 82 6 0 0 ... 2 11 4 10 71.5% 73 90 10 12 0
11 Republic of Ireland 1 7 12 36.8% 5.2% 28 0 0 0 ... 0 23 9 17 65.4% 43 51 11 6 1
12 Russia 5 9 31 22.5% 12.5% 59 2 0 0 ... 0 8 3 10 77.0% 34 43 4 6 0
13 Spain 12 42 33 55.9% 16.0% 100 0 1 0 ... 5 8 1 15 93.8% 102 83 19 11 0
14 Sweden 5 17 19 47.2% 13.8% 39 3 0 0 ... 1 12 5 8 61.6% 35 51 7 7 0
15 Ukraine 2 7 26 21.2% 6.0% 38 0 0 0 ... 0 4 4 13 76.5% 48 31 4 5 0

16 rows × 32 columns

Step 14. Present only the Shooting Accuracy from England, Italy and Russia

In [20]:
sa = euro12.set_index('Team')
sa.loc[['England','Italy','Russia'],['Shooting Accuracy']]
Out[20]:
Shooting Accuracy
Team
England 50.0%
Italy 43.0%
Russia 22.5%
QUESTION 3

Housing

Step 1. Import the necessary libraries

In [21]:
import pandas as pd
import numpy as np
import random

Step 2.Create 3 differents Series, each of length 100, as follows:

The first a random number from 1 to 4

In [22]:
series1 = pd.Series(np.random.randint(1, high=5, size=100, dtype='l'))

The second a random number from 1 to 3

In [23]:
series2 = pd.Series(np.random.randint(1, high=4, size=100, dtype='l'))

The third a random number from 10,000 to 30,000

In [24]:
series3 = pd.Series(np.random.randint(10000, high=300001, size=100, dtype='l'))

Step 3.Create a DataFrame by joinning the Series by column

In [25]:
all_series = pd.concat([series1,series2,series3],axis=1)
all_series.head()
Out[25]:
0 1 2
0 4 1 244444
1 2 1 83801
2 4 2 288413
3 1 3 190856
4 4 2 145935

Step 4.Change the name of the columns to bedrs, bathrs, price_sqr_meter

In [26]:
all_series.rename(columns = {0:'bedrs', 1:'bathrs' ,2:'price_sqr_meter'}, inplace =True) 
all_series.head()
Out[26]:
bedrs bathrs price_sqr_meter
0 4 1 244444
1 2 1 83801
2 4 2 288413
3 1 3 190856
4 4 2 145935

Step 5.Create a one column DataFrame with the values of the 3 Series and assign it to 'bigcolumn'

In [27]:
#join the values using concat

one_col_df = pd.concat([series1,series2,series3],axis=0)

#convert table to a dataframe
bigcolumn = one_col_df.to_frame()

#rename the column
bigcolumn.rename(columns = {0:'All series'},inplace = True)

bigcolumn
Out[27]:
All series
0 4
1 2
2 4
3 1
4 4
... ...
95 266475
96 250411
97 111942
98 229190
99 39256

300 rows × 1 columns

Step 6. Ops it seems it is going only until index 99. Is it true?

In [28]:
#len(bigcolumn) or 

bigcolumn.count()
Out[28]:
All series    300
dtype: int64

Step 7. Reindex the DataFrame so it goes from 0 to 299

In [29]:
bigcolumn.reset_index(drop=True, inplace=True)
bigcolumn
Out[29]:
All series
0 4
1 2
2 4
3 1
4 4
... ...
295 266475
296 250411
297 111942
298 229190
299 39256

300 rows × 1 columns

QUESTION 4

Wind Statistics

The first three columns are year, month, and day. The remaining 12 columns are average windspeeds in knots at 12 locations in Ireland on that day.

Step 1. Import the necessary libraries Step 2. Import the dataset from the attached file wind.txt Step 3. Assign it to a variable called data and replace the first 3 columns by a proper datetime index.

In [30]:
import pandas as pd
import datetime
In [31]:
# parse_dates gets 0, 1, 2 columns and parses them as the index
txt_data = 'wind.txt'
data = pd.read_csv(txt_data, sep = "\s+", parse_dates = [[0,1,2]]) 
data.head()
Out[31]:
Yr_Mo_Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
0 2061-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
1 2061-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
2 2061-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
3 2061-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
4 2061-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83

Step 4. Year 2061? Do we really have data from this year? Create a function to fix it and apply it.

In [32]:
# fix_date uses datetime
def fix_date(x):
  year = x.year - 100 if x.year > 1989 else x.year
  return datetime.date(year, x.month, x.day)

# apply the function fix_date on the column and replace values
data['Yr_Mo_Dy'] = data['Yr_Mo_Dy'].apply(fix_date)

data.head()
Out[32]:
Yr_Mo_Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
0 1961-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
1 1961-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
2 1961-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
3 1961-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
4 1961-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83

Step 5. Set the right dates as the index. Pay attention at the data type, it should be datetime64[ns].

In [33]:
#date type datetime64
data["Yr_Mo_Dy"] = pd.to_datetime(data["Yr_Mo_Dy"])

#Set index
data = data.set_index('Yr_Mo_Dy')

data.head()
Out[33]:
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
Yr_Mo_Dy
1961-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
1961-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
1961-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
1961-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
1961-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83

Step 6. Compute how many values are missing for each location over the entire record.They should be ignored in all calculations below.

In [34]:
data.isnull().sum()
Out[34]:
RPT    6
VAL    3
ROS    2
KIL    5
SHA    2
BIR    0
DUB    3
CLA    2
MUL    3
CLO    1
BEL    0
MAL    4
dtype: int64

Step 7. Compute how many non-missing values there are in total.

In [35]:
data.notnull().sum()
Out[35]:
RPT    6568
VAL    6571
ROS    6572
KIL    6569
SHA    6572
BIR    6574
DUB    6571
CLA    6572
MUL    6571
CLO    6573
BEL    6574
MAL    6570
dtype: int64

Step 8. Calculate the mean windspeeds of the windspeeds over all the locations and all the times. A single number for the entire dataset.

In [36]:
data.sum().sum() / data.notna().sum().sum()
Out[36]:
10.227883764282181

Step 9. Create a DataFrame called loc_stats and calculate the min, max and mean windspeeds and standard deviations of the windspeeds at each location over all the days A different set of numbers for each location.

In [37]:
loc_stats = data.describe(percentiles=[])
loc_stats
Out[37]:
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
count 6568.000000 6571.000000 6572.000000 6569.000000 6572.000000 6574.000000 6571.000000 6572.000000 6571.000000 6573.000000 6574.000000 6570.000000
mean 12.362987 10.644314 11.660526 6.306468 10.455834 7.092254 9.797343 8.495053 8.493590 8.707332 13.121007 15.599079
std 5.618413 5.267356 5.008450 3.605811 4.936125 3.968683 4.977555 4.499449 4.166872 4.503954 5.835037 6.699794
min 0.670000 0.210000 1.500000 0.000000 0.130000 0.000000 0.000000 0.000000 0.000000 0.040000 0.130000 0.670000
50% 11.710000 10.170000 10.920000 5.750000 9.960000 6.830000 9.210000 8.080000 8.170000 8.290000 12.500000 15.000000
max 35.800000 33.370000 33.840000 28.460000 37.540000 26.160000 30.370000 31.080000 25.880000 28.210000 42.380000 42.540000

Step 10. Create a DataFrame called day_stats and calculate the min, max and mean windspeed and standard deviations of the windspeeds across all the locations at each day. A different set of numbers for each day.

In [38]:
#create the dataframe
day_stats = pd.DataFrame()


day_stats['min'] = data.min(axis = 1) #min
day_stats['max'] = data.max(axis = 1) #max
day_stats['mean'] = data.mean(axis = 1) #mean
day_stats['standard_deviations'] = data.std(axis = 1) #standard deviations

day_stats.head()
Out[38]:
min max mean standard_deviations
Yr_Mo_Dy
1961-01-01 9.29 18.50 13.018182 2.808875
1961-01-02 6.50 17.54 11.336364 3.188994
1961-01-03 6.17 18.50 11.641818 3.681912
1961-01-04 1.79 11.75 6.619167 3.198126
1961-01-05 6.17 13.33 10.630000 2.445356

Step 11. Find the average windspeed in January for each location. Treat January 1961 and January 1962 both as January.

In [39]:
data.loc[data.index.month == 1].mean()
Out[39]:
RPT    14.847325
VAL    12.914560
ROS    13.299624
KIL     7.199498
SHA    11.667734
BIR     8.054839
DUB    11.819355
CLA     9.512047
MUL     9.543208
CLO    10.053566
BEL    14.550520
MAL    18.028763
dtype: float64

Step 12. Downsample the record to a yearly frequency for each location.

In [40]:
data.groupby(data.index.to_period('A')).mean()
Out[40]:
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
Yr_Mo_Dy
1961 12.299583 10.351796 11.362369 6.958227 10.881763 7.729726 9.733923 8.858788 8.647652 9.835577 13.502795 13.680773
1962 12.246923 10.110438 11.732712 6.960440 10.657918 7.393068 11.020712 8.793753 8.316822 9.676247 12.930685 14.323956
1963 12.813452 10.836986 12.541151 7.330055 11.724110 8.434712 11.075699 10.336548 8.903589 10.224438 13.638877 14.999014
1964 12.363661 10.920164 12.104372 6.787787 11.454481 7.570874 10.259153 9.467350 7.789016 10.207951 13.740546 14.910301
1965 12.451370 11.075534 11.848767 6.858466 11.024795 7.478110 10.618712 8.879918 7.907425 9.918082 12.964247 15.591644
1966 13.461973 11.557205 12.020630 7.345726 11.805041 7.793671 10.579808 8.835096 8.514438 9.768959 14.265836 16.307260
1967 12.737151 10.990986 11.739397 7.143425 11.630740 7.368164 10.652027 9.325616 8.645014 9.547425 14.774548 17.135945
1968 11.835628 10.468197 11.409754 6.477678 10.760765 6.067322 8.859180 8.255519 7.224945 7.832978 12.808634 15.017486
1969 11.166356 9.723699 10.902000 5.767973 9.873918 6.189973 8.564493 7.711397 7.924521 7.754384 12.621233 15.762904
1970 12.600329 10.726932 11.730247 6.217178 10.567370 7.609452 9.609890 8.334630 9.297616 8.289808 13.183644 16.456027
1971 11.273123 9.095178 11.088329 5.241507 9.440329 6.097151 8.385890 6.757315 7.915370 7.229753 12.208932 15.025233
1972 12.463962 10.561311 12.058333 5.929699 9.430410 6.358825 9.704508 7.680792 8.357295 7.515273 12.727377 15.028716
1973 11.828466 10.680493 10.680493 5.547863 9.640877 6.548740 8.482110 7.614274 8.245534 7.812411 12.169699 15.441096
1974 13.643096 11.811781 12.336356 6.427041 11.110986 6.809781 10.084603 9.896986 9.331753 8.736356 13.252959 16.947671
1975 12.008575 10.293836 11.564712 5.269096 9.190082 5.668521 8.562603 7.843836 8.797945 7.382822 12.631671 15.307863
1976 11.737842 10.203115 10.761230 5.109426 8.846339 6.311038 9.149126 7.146202 8.883716 7.883087 12.332377 15.471448
1977 13.099616 11.144493 12.627836 6.073945 10.003836 8.586438 11.523205 8.378384 9.098192 8.821616 13.459068 16.590849
1978 12.504356 11.044274 11.380000 6.082356 10.167233 7.650658 9.489342 8.800466 9.089753 8.301699 12.967397 16.771370

Step 13. Downsample the record to a monthly frequency for each location.

In [41]:
data.groupby(data.index.to_period('M')).mean()
Out[41]:
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
Yr_Mo_Dy
1961-01 14.841333 11.988333 13.431613 7.736774 11.072759 8.588065 11.184839 9.245333 9.085806 10.107419 13.880968 14.703226
1961-02 16.269286 14.975357 14.441481 9.230741 13.852143 10.937500 11.890714 11.846071 11.821429 12.714286 18.583214 15.411786
1961-03 10.890000 11.296452 10.752903 7.284000 10.509355 8.866774 9.644194 9.829677 10.294138 11.251935 16.410968 15.720000
1961-04 10.722667 9.427667 9.998000 5.830667 8.435000 6.495000 6.925333 7.094667 7.342333 7.237000 11.147333 10.278333
1961-05 9.860968 8.850000 10.818065 5.905333 9.490323 6.574839 7.604000 8.177097 8.039355 8.499355 11.900323 12.011613
... ... ... ... ... ... ... ... ... ... ... ... ...
1978-08 9.645161 8.259355 9.032258 4.502903 7.368065 5.935161 5.650323 5.417742 7.241290 5.536774 10.466774 12.054194
1978-09 10.913667 10.895000 10.635000 5.725000 10.372000 9.278333 10.790333 9.583000 10.069333 8.939000 15.680333 19.391333
1978-10 9.897742 8.670968 9.295806 4.721290 8.525161 6.774194 8.115484 7.337742 8.297742 8.243871 13.776774 17.150000
1978-11 16.151667 14.802667 13.508000 7.317333 11.475000 8.743000 11.492333 9.657333 10.701333 10.676000 17.404667 20.723000
1978-12 16.175484 13.748065 15.635161 7.094839 11.398710 9.241613 12.077419 10.194839 10.616774 11.028710 13.859677 21.371613

216 rows × 12 columns

Step 14. Downsample the record to a weekly frequency for each location.

In [42]:
data.groupby(data.index.to_period('W')).mean()
Out[42]:
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
Yr_Mo_Dy
1960-12-26/1961-01-01 15.040000 14.960000 13.170000 9.290000 NaN 9.870000 13.670000 10.250000 10.830000 12.580000 18.500000 15.040000
1961-01-02/1961-01-08 13.541429 11.486667 10.487143 6.417143 9.474286 6.435714 11.061429 6.616667 8.434286 8.497143 12.481429 13.238571
1961-01-09/1961-01-15 12.468571 8.967143 11.958571 4.630000 7.351429 5.072857 7.535714 6.820000 5.712857 7.571429 11.125714 11.024286
1961-01-16/1961-01-22 13.204286 9.862857 12.982857 6.328571 8.966667 7.417143 9.257143 7.875714 7.145714 8.124286 9.821429 11.434286
1961-01-23/1961-01-29 19.880000 16.141429 18.225714 12.720000 17.432857 14.828571 15.528571 15.160000 14.480000 15.640000 20.930000 22.530000
... ... ... ... ... ... ... ... ... ... ... ... ...
1978-11-27/1978-12-03 14.934286 11.232857 13.941429 5.565714 10.215714 8.618571 9.642857 7.685714 9.011429 9.547143 11.835714 18.728571
1978-12-04/1978-12-10 20.740000 19.190000 17.034286 9.777143 15.287143 12.774286 14.437143 12.488571 13.870000 14.082857 18.517143 23.061429
1978-12-11/1978-12-17 16.758571 14.692857 14.987143 6.917143 11.397143 7.272857 10.208571 7.967143 9.168571 8.565714 11.102857 15.562857
1978-12-18/1978-12-24 11.155714 8.008571 13.172857 4.004286 7.825714 6.290000 7.798571 8.667143 7.151429 8.072857 11.845714 18.977143
1978-12-25/1978-12-31 14.951429 11.801429 16.035714 6.507143 9.660000 8.620000 13.708571 10.477143 10.868571 11.471429 12.947143 26.844286

940 rows × 12 columns

Step 15. Calculate the min, max and mean windspeeds and standard deviations of the windspeeds across all locations for each week (assume that the first week starts on January 2 1961) for the first 52 weeks.

In [43]:
week = data.resample('W').agg(['min','max','mean','std'])

# slice first 52 weeks
week.loc[week.index[1:53], "RPT":"MAL"] .head(5)
Out[43]:
RPT VAL ROS ... CLO BEL MAL
min max mean std min max mean std min max ... mean std min max mean std min max mean std
Yr_Mo_Dy
1961-01-08 10.58 18.50 13.541429 2.631321 6.63 16.88 11.486667 3.949525 7.62 12.33 ... 8.497143 1.704941 5.46 17.54 12.481429 4.349139 10.88 16.46 13.238571 1.773062
1961-01-15 9.04 19.75 12.468571 3.555392 3.54 12.08 8.967143 3.148945 7.08 19.50 ... 7.571429 4.084293 5.25 20.71 11.125714 5.552215 5.17 16.92 11.024286 4.692355
1961-01-22 4.92 19.83 13.204286 5.337402 3.42 14.37 9.862857 3.837785 7.29 20.79 ... 8.124286 4.783952 6.50 15.92 9.821429 3.626584 6.79 17.96 11.434286 4.237239
1961-01-29 13.62 25.04 19.880000 4.619061 9.96 23.91 16.141429 5.170224 12.67 25.84 ... 15.640000 3.713368 14.04 27.71 20.930000 5.210726 17.50 27.63 22.530000 3.874721
1961-02-05 10.58 24.21 16.827143 5.251408 9.46 24.21 15.460000 5.187395 9.04 19.70 ... 9.460000 2.839501 9.17 19.33 14.012857 4.210858 7.17 19.25 11.935714 4.336104

5 rows × 48 columns

QUESTION 5

Step 1. Import the necessary libraries

In [44]:
import pandas as pd
import numpy as np

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called chipo.

Step 4. See the first 10 entries

In [45]:
data_url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv"

chipo = pd.read_csv(data_url, sep = '\t')
In [46]:
chipo.head(10)
Out[46]:
order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98
5 3 1 Chicken Bowl [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... $10.98
6 3 1 Side of Chips NaN $1.69
7 4 1 Steak Burrito [Tomatillo Red Chili Salsa, [Fajita Vegetables... $11.75
8 4 1 Steak Soft Tacos [Tomatillo Green Chili Salsa, [Pinto Beans, Ch... $9.25
9 5 1 Steak Burrito [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... $9.25
In [47]:
#removing dollar sign in item price
chipo['item_price'] = chipo['item_price'].str.replace('$', '')

chipo.head(5)
/var/folders/zx/z32cgc1s3cj0fh_nk6cynqh00000gn/T/ipykernel_51736/433152020.py:2: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  chipo['item_price'] = chipo['item_price'].str.replace('$', '')
Out[47]:
order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN 2.39
1 1 1 Izze [Clementine] 3.39
2 1 1 Nantucket Nectar [Apple] 3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN 2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... 16.98

Step 5. What is the number of observations in the dataset?

In [48]:
chipo.shape[0]
Out[48]:
4622

Step 6. What is the number of columns in the dataset?

In [49]:
chipo.shape[1]
Out[49]:
5

Step 7. Print the name of all the columns.

In [50]:
chipo.columns
Out[50]:
Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')

Step 8. How is the dataset indexed?

In [51]:
chipo.index
Out[51]:
RangeIndex(start=0, stop=4622, step=1)

Step 9. Which was the most-ordered item?

In [52]:
most_ordered = chipo.groupby('item_name').sum().sort_values(['quantity'], ascending=False)
most_ordered.head(1)
Out[52]:
order_id quantity
item_name
Chicken Bowl 713926 761

Step 10. For the most-ordered item, how many items were ordered?

In [53]:
most_ordered_item = chipo.groupby('item_name').sum().sort_values(['quantity'], ascending=False)
most_ordered_item.head(1)
Out[53]:
order_id quantity
item_name
Chicken Bowl 713926 761

Step 11. What was the most ordered item in the choice_description column? Step 12. How many items were orderd in total?

In [54]:
most_ordered = chipo.groupby('choice_description').sum().sort_values(['quantity'], ascending=False)
most_ordered.head(1)
Out[54]:
order_id quantity
choice_description
[Diet Coke] 123455 159

Step 12. How many items were orderd in total?

In [55]:
chipo.quantity.sum()
Out[55]:
4972

Step 13.

• Turn the item price into a float

• Check the item price type

• Create a lambda function and change the type of item price

• Check the item price type

In [56]:
 item_price= lambda x: float(x[1:-1])
chipo.item_price = chipo.item_price.apply(item_price)
chipo.item_price
Out[56]:
0       0.39
1       0.39
2       0.39
3       0.39
4       6.98
        ... 
4617    1.75
4618    1.75
4619    1.25
4620    0.75
4621    0.75
Name: item_price, Length: 4622, dtype: float64
In [57]:
chipo.item_price.dtype
Out[57]:
dtype('float64')

Step 14. How much was the revenue for the period in the dataset? Step 15. How many orders were made in the period?

In [58]:
revenue = (chipo['quantity'] * chipo['item_price']).sum()
revenue
Out[58]:
5105.02

Step 15. How many orders were made in the period?

In [59]:
chipo.order_id.value_counts().count()
Out[59]:
1834

Step 16. What is the average revenue amount per order?

In [60]:
order_grouped = chipo.groupby(by=['order_id']).sum()
order_grouped['item_price'].mean()
Out[60]:
2.201286804798244

Step 17. How many different items are sold?

In [61]:
chipo.item_name.value_counts().count()
Out[61]:
50
QUESTION 6

Create a line plot showing the number of marriages and divorces per capita in the U.S. between 1867 and 2014. Label both lines and show the legend. Don't forget to label your axes!

In [62]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [63]:
marriages_divorces = pd.read_csv('us-marriages-divorces-1867-2014.csv')
#marriages_divorces.info()
In [64]:
marriages_divorces.head()
Out[64]:
Year Marriages Divorces Population Marriages_per_1000 Divorces_per_1000
0 1867 357000.0 10000.0 36970000 9.7 0.3
1 1868 345000.0 10000.0 37885000 9.1 0.3
2 1869 348000.0 11000.0 38870000 9.0 0.3
3 1870 352000.0 11000.0 39905000 8.8 0.3
4 1871 359000.0 12000.0 41010000 8.8 0.3
In [65]:
years = marriages_divorces['Year']
marriages = marriages_divorces['Marriages_per_1000']
divorces = marriages_divorces['Divorces_per_1000']
marriages_divorces = plt.figure(figsize=(16,8))
marriages_divorces = plt.plot(years, marriages, label='Marriages per 1000')
marriages_divorces = plt.plot(years, divorces, label='Divorces per 1000')
marriages_divorces = plt.title("Number of marriages and divorces per capita in the U.S. between 1867 and 2014")
marriages_divorces = plt.xlabel("Years",fontsize=14)
marriages_divorces = plt.legend(fontsize = 12, loc = "upper left")
marriages_divorces = plt.ylabel("Marriages",fontsize=14)
marriages_divorces = plt.grid(True)
marriages_divorces
QUESTION 7

Create a vertical bar chart comparing the number of marriages and divorces per capita in the U.S. between 1900, 1950, and 2000. Don't forget to label your axes!

In [66]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
In [67]:
marriages_divorces1 = pd.read_csv('us-marriages-divorces-1867-2014.csv')
In [68]:
filterd = marriages_divorces1.loc[marriages_divorces1.Year.isin([1900,1950,2000]),['Year','Marriages_per_1000','Divorces_per_1000']]
filterd
Out[68]:
Year Marriages_per_1000 Divorces_per_1000
33 1900 9.3 0.7
83 1950 11.0 2.5
133 2000 8.2 3.3
In [69]:
#Stacked Bar Chart using plotly

fig = go.Figure(data=[
    go.Bar(name='Marriages_per_1000', x= filterd['Year'], y=filterd['Marriages_per_1000']),
    go.Bar(name='Divorces_per_1000', x=filterd['Year'], y=filterd['Divorces_per_1000'])
           ])
# Change the bar mode
fig.update_layout(barmode='stack',
                 title="Number of marriages and divorces per capita in the U.S. between 1900, 1950, and 2000",
     xaxis_title="Years",
    yaxis_title="Frequency",)
fig.show()
QUESTION 8

Create a horizontal bar chart that compares the deadliest actors in Hollywood. Sort the actors by their kill count and label each bar with the corresponding actor's name. Don't forget to label your axes!

In [70]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
In [71]:
dead = pd.read_csv('actor_kill_counts.csv')
dead
Out[71]:
Actor Count
0 Arnold Schwarzenegger 369
1 Chow Yun-Fat 295
2 Clint Eastwood 207
3 Clive Owen 194
4 Dolph Lundgren 239
5 Jet Li 201
6 Nicolas Cage 204
7 Sylvester Stallone 267
8 Tomisaburo Wakayama 226
9 Wesley Snipes 193
In [72]:
fig = go.Figure(go.Bar(
            x=dead["Count"],
            y=dead["Actor"],
            orientation='h'))

fig.update_layout(
    title="Deadliest actors in Hollywood",
     xaxis_title="Kill Count",
    yaxis_title="Actor Names",
       
        )

fig.show()
QUESTION 9

Create a pie chart showing the fraction of all Roman Emperors that were assassinated. Make sure that the pie chart is an even circle, labels the categories, and shows the percentage breakdown of the categories.

In [73]:
import pandas as pd
import numpy as np
import plotly.express as px
In [74]:
re = pd.read_csv("roman-emperor-reigns.csv")
In [75]:
re = re[re["Cause_of_Death"]=="Assassinated"]
re
Out[75]:
Emperor Length_of_Reign Cause_of_Death
2 Caligula 4.83 Assassinated
5 Galba 0.58 Assassinated
7 Vitellius 0.67 Assassinated
10 Domitian 15.00 Assassinated
17 Commodus 15.00 Assassinated
18 Pertinax 0.25 Assassinated
21 Caracalla 19.00 Assassinated
22 Geta 3.00 Assassinated
24 Elagabalus 3.75 Assassinated
25 Severus Alexander 13.00 Assassinated
26 Maximinus I 3.25 Assassinated
29 Pupienus 0.25 Assassinated
30 Balbinus 0.25 Assassinated
35 Trebonianus Gallus 2.00 Assassinated
36 Aemilian 0.16 Assassinated
38 Gallienus 15.00 Assassinated
41 Aurelian 5.00 Assassinated
43 Florian 0.25 Assassinated
44 Probus 6.00 Assassinated
52 Severus II 1.00 Assassinated
59 Constans I 13.00 Assassinated
65 Gratian 16.00 Assassinated
In [76]:
fig = px.pie(re, values=re.Length_of_Reign, names=re.Emperor)

fig.show()
QUESTION 10

Create a scatter plot showing the relationship between the total revenue earned by arcades and the number of Computer Science PhDs awarded in the U.S. between 2000 and 2009. Don't forget to label your axes! Color each dot according to its year.

In [77]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plot
In [78]:
phd = pd.read_csv("arcade-revenue-vs-cs-doctorates.csv")
phd
Out[78]:
Year Total Arcade Revenue (billions) Computer Science Doctorates Awarded (US)
0 2000 1.196 861
1 2001 1.176 830
2 2002 1.269 809
3 2003 1.240 867
4 2004 1.307 948
5 2005 1.435 1129
6 2006 1.601 1453
7 2007 1.654 1656
8 2008 1.803 1787
9 2009 1.734 1611
In [79]:
a = phd.groupby('Year')
for name, group in a:
    plot.plot(group['Total Arcade Revenue (billions)'], group['Computer Science Doctorates Awarded (US)'], marker='o', linestyle='', markersize=12, label=name)
plot.legend()
plot.xlabel("Total revenue earned by Arcades")
plot.ylabel("Computer Science Awards")
plot.figure(figsize=(20,15))
Out[79]:
<Figure size 1440x1080 with 0 Axes>
<Figure size 1440x1080 with 0 Axes>